Concatenative Mandarin Tts Accommodating Isolated English Words

نویسندگان

  • Zhenli YU
  • Dongjian YUE
  • Jian-Cheng HUANG
چکیده

An experiment to explore the method realizing a concatenative Chinese TTS accommodating isolated English words is presented. The experiment was based on an existing concatenative Mandarin TTS system, developed in Motorola China Research Center. The experimental system employs an English word synthesizer based on the concatenation of speech segments stored in an English corpus. The original English corpus contains isolated words uttered by a professional English speaker. The Chinese speaker who uttered the Mandarin TTS speech corpus uttered the same set of English words. The English word corpus uttered by the English speaker was then modified by the English word corpus uttered by the Mandarin speaker, on a word-byword basis. A voice conversion technique is applied to modify the English word corpus. The voice conversion is focused on the voiced phones. The conversion process is basically a pitch scaling and spectral envelope scaling based on a phone level average. 1. INDTRODUCTION Recent Chinese TTS systems tend to accommodate Chinese text mixed with isolated English words. The typical way for doing so is to allow a system alternating between two acoustically independent engines, with one serving Chinese text and the other serving English words. An obvious drawback of this approach is that the alternate voice can result in an unpleasant audio effect. In this paper, we present a new method to tackle the voice discontinuity problem based on a voice conversion technique. The proposed method combines a concatenative Mandarin TTS system with an English isolated word synthesizer. The voice of the English synthesizer was converted to the voice of the Mandarin system, allowing users to perceive a single speaker voice in both Chinese and English read-out. Our experiment was based on the existing concatenative Mandarin TTS system, MRhyker, developed in Motorola China Research Center [1]. Accompanied with MRhyker is an English word synthesizer based on the concatenation of speech segments stored in an English corpus. Originally, the English speech corpus contained about 2,000 words (referred to as EE-corpus) uttered by a professional English speaker. The Chinese speaker who uttered the MRhyker speech corpus uttered the same set of words used in the EE-corpus, creating a corpus, called CEcorpus. The EE-corpus was then modified by the CE-corpus, on a word-by-word basis. A voice conversion technique is applied to modify EE-corpus using CE-corpus. Considering that speaker feature is mainly conveyed in the voiced part of speech, the conversion is needed only on the voiced phones (V-phones). The conversion process is basically a pitch scaling and spectral envelope scaling based on a phone level average. To demonstrate the conversion effect, we applied parametric speech coding to speech segments in all corpora. The coding scheme used in the experiment is the harmonic vector excitation coding (HVXC) [2]. We detected V-phones in the EE-corpus and CE-corpus, respectively, and grouped them into pairs on a word-by-word and phone-by-phone basis. The mean LSF vector, mean pitch value, and spectral mean of LP residual signal of the V-phone pairs are analyzed. The V-phones of EE-corpus are then modified by scaling the mid-term mean values so that they have identical mid-term means as their counterparts of the CE-corpus have. As a result, the modified V-phones keep the temporal dynamics of the English speaker, while they sound like uttered by the Mandarin speaker. The experiment shows some promise that a single speaker bilingual TTS can be realized, even though a real single speaker bilingual corpus is not available. Particularly, because the MRhyker is an embeddable concatenative Chinese TTS system, the present method gives a promising solution to the embedded single speaker bi-lingual TTS. 2. THE EXPERIMENT FRAMEWORK The experiment was based on the existing concatenative Mandarin TTS system, MRhyker [1], developed in Motorola China Research Center. The entire experimental flowchart consists of four major parts, English/Chinese separator, Chinese concatenative acoustic parameter arrays generator, isolated English word acoustic parameter arrays generator and concatenative waveforms synthesizer. The flowchart is shown in figure-1. The mixed English words are detected from the input text with the English/Chinese separator module. The Chinese text is processed to select the acoustic units following the major procedures of MRhyker. This part results in the acoustic parameter arrays for concatenation for the Chinese text. The acoustic parameter arrays for concatenation for English word are also selected in parallel. At the stage of concatenation, the acoustic parameters arrays of the Chinese text and the isolated English word are concatenated and the output waveforms of the full text are then synthesized with the concatenative waveforms synthesizer. Figure-1 The experimental flowchart 3. VOICE CONVERSION OF ENGLISH WORD CORPUS 3.1 English word corpora Originally, the English speech corpus contained about 2,000 most frequently used isolated words (referred to as EE-corpus) uttered by a professional English speaker. The Chinese speaker who uttered the MRhyker speech corpus uttered the same set of words used in the EE-corpus, creating a corpus, called CEcorpus. Both the CE-corpus and the EE-corpus are segmented at the phone level. Although the Chinese speaker was required to utter the same set of words as the English speaker does, the English speaking quality of the Chinese speaker may not be as good as the professional English speaker. To allow users perceive the readout of both Chinese text and English words as sounded by a single and professional speaker, voice conversion technique is applied to modify the EE-corpus using the speaker features of the CE-corpus. 3.2 Speaker feature modification Assuming that speaker feature is mainly conveyed in the voiced portion of speech, the conversion is needed only on the voiced phones (V-phones). The conversion process is basically a pitch scaling and spectral envelope scaling based on a phone level average. To demonstrate the conversion effect, we applied parametric speech coding to speech segments in all corpora. The coding scheme used in the experiment is the harmonic vector excitation coding (HVXC). V-phones are detected in the EE-corpus and CE-corpus, respectively, and grouped into pairs on a phone-byphone (referred as CE-V-phone and EE-V-Phone) basis. Spectral analysis, pitch estimation are conducted with the analysis module of HVXC. Glottal wave (residual signal) is also estimated (as shown in figure-2). 3.2.1 Pitch modification For a V-Phone pair, both the EE-V-phone and the CE-V-phone is segmented e T and c T frames. The pitch trajectories of the Vphone pair are expressed as, e T , 1 t , e t P = (1)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syllable HMM based Mandarin TTS and comparison with concatenative TTS

This paper introduces a Syllable HMM based Mandarin TTS system. 10-state left-to-right HMMs are used to model each syllable. We leverage the corpus and the front end of a concatenative TTS system to build the Syllable HMM based TTS system. Furthermore, we utilize the unique consonant/vowel structure of Mandarin syllable to improve the voiced/unvoiced decision of HMM states. Evaluation results s...

متن کامل

An NN-based Approach to Prosodic for Synthesizing English Words Em

In this paper, a neural network-based approach to generating proper prosodic information for spelling/reading English words embedded in background Chinese texts is discussed. It expands an existing RNN-based prosodic information generator for Mandarin TTS to an RNN-MLP scheme for Mandarin-English mixed-lingual TTS. It first treats each English word as a Chinese word and uses the RNN, trained fo...

متن کامل

SSML Extensions Aimed To Improve Asian Language TTS Rendering

Both formant synthesis based and concatenative acoustic unit based TTS systems have been developled in Nokia. Many non-English languages have been considered in the development work, and Nokia's Mandarin Chinese TTS system is under continuous development within the TC-STAR framework (www.tc-star.org). To meet the needs of the TTS evaluations in TC-STAR, common interfaces for the input and all t...

متن کامل

Data pruning approach to unit selection for inventory generation of concatenative embeddable Chinese TTS systems

In this paper, a data pruning approach is presented for building acoustic unit inventory for syllable-based concatenative embeddable Chinese TTS system. A 3-portion segmentation of a syllable is proposed based on the nature of voiced/unvoiced structure of Chinese syllable. Individual factorial acoustic measurement of syllable is used to calculate the penalty of perceptual unsatisfactory for con...

متن کامل

Spectral Continuity Measures at Mandarin Syllable Boundaries

In Text-to-Speech (TTS) systems based on concatenative synthesis, the naturalness of synthetic speech is highly affected by the spectral continuities at the concatenation point. In this paper, we focused on 4 kinds of syllable boundaries in mandarin and used several spectral distance measures combined with time derivatives distance measures to predict their audible discontinuities. A perceptual...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002